Skip to content

Add env DeviceCopy::* and tests#7965

Open
gonidelis wants to merge 3 commits intoNVIDIA:mainfrom
gonidelis:copy_env
Open

Add env DeviceCopy::* and tests#7965
gonidelis wants to merge 3 commits intoNVIDIA:mainfrom
gonidelis:copy_env

Conversation

@gonidelis
Copy link
Member

@gonidelis gonidelis commented Mar 10, 2026

fixes #7539

This one is a bit peculiar.

  • DeviceCopy::Batched uses dispatch_with_env like other CUB env overloads — it has its own kernel dispatch via DispatchBatchMemcpy.

  • DeviceCopy::Copy does not launch its own kernels, but it rather delegates entirely to DeviceTransform::Transform and DeviceFor::ForEachInExtents, both of which already accept an env. The env is forwarded directly through
    detail::copy_mdspan::copy, which is generalized from cudaStream_t to a templated EnvT parameter.

    • Because Copy delegates to APIs that already emit NVTX ranges, the Copy env overload does not emit its own _CCCL_NVTX_RANGE_SCOPE to avoid nested NVTX range violations.

    • mdspan Copy env tests are only in _env_api.cu, not in _env.cu. The _env.cu file defines stream_registry_factory_t as the CUB_DETAIL_DEFAULT_KERNEL_LAUNCHER_FACTORY, which intercepts kernel launches to verify env usage. Since mdspan Copy does not launch its own kernels, the factory intercepts at the wrong level (inside Transform/ForEachInExtents), causing nvcc compilation failures. The _env_api.cu tests (which don't use the factory) provide sufficient coverage for the mdspan Copy env path.

@gonidelis gonidelis requested a review from a team as a code owner March 10, 2026 01:27
@gonidelis gonidelis requested a review from elstehle March 10, 2026 01:27
@github-project-automation github-project-automation bot moved this to Todo in CCCL Mar 10, 2026
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Mar 10, 2026
@github-actions

This comment has been minimized.

@gonidelis gonidelis enabled auto-merge (squash) March 10, 2026 03:22
@bernhardmgruber
Copy link
Contributor

Because Copy delegates to APIs that already emit NVTX ranges, the Copy env overload does not emit its own _CCCL_NVTX_RANGE_SCOPE to avoid nested NVTX range violations.

I think that's the wrong approach. We should emit an NVTX range at the first API entry into CUB. Later CUB algorithms should not emit NVTX ranges.

Comment on lines +233 to +234
typename EnvT = ::cuda::std::execution::env<>,
::cuda::std::enable_if_t<!::cuda::std::is_same_v<InputIt, void*>, int> = 0>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: We do not need a constraint here, since Batched here has 4-5 parameters, and the old overload has to 6-7, so they can never be ambiguous.

Suggested change
typename EnvT = ::cuda::std::execution::env<>,
::cuda::std::enable_if_t<!::cuda::std::is_same_v<InputIt, void*>, int> = 0>
typename EnvT = ::cuda::std::execution::env<>>

auto in_end = in_start + mdspan_in.mapping().required_span_size();
auto out_start = mdspan_out.data_handle();
auto out_end = out_start + mdspan_out.mapping().required_span_size();
_CCCL_ASSERT(!(in_end >= out_start && out_end >= in_start), "mdspan memory ranges must not overlap");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please retain the comment

Suggested change
_CCCL_ASSERT(!(in_end >= out_start && out_end >= in_start), "mdspan memory ranges must not overlap");
// TODO(fbusato): replace with __are_ptrs_overlapping
_CCCL_ASSERT(!(in_end >= out_start && out_end >= in_start), "mdspan memory ranges must not overlap");

Comment on lines +449 to +450
{
// no nvtx range because Copy delegates to Transform/ForEachInExtents which emit their own NVTX ranges
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: Please emit an NVTX range here.

Copy link
Member Author

@gonidelis gonidelis Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how? comment explains why i can't you explained it right underneath

Comment on lines +65 to +70
mdspan_in.size(),
::cuda::proclaim_copyable_arguments(::cuda::std::identity{}),
stream);
env);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: to avoid cub::DeviceTransform emitting an NVTX range here, maybe just call TransformInternal instead. If you rename TransformInternal yo __trasnform_internal you can make it public.

@@ -66,11 +67,11 @@ copy(::cuda::std::mdspan<T_In, E_In, L_In, A_In> mdspan_in,
mdspan_out.data_handle(),
mdspan_in.size(),
::cuda::proclaim_copyable_arguments(::cuda::std::identity{}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: You can remove proclaim_copyable_arguments as well, since it's not needed.

// TODO (fbusato): add ForEachInLayout when mdspan_in and mdspan_out have compatible layouts
// Compatible layouts could use more efficient iteration patterns
return cub::DeviceFor::ForEachInExtents(mdspan_in.extents(), copy_mdspan_t{mdspan_in, mdspan_out}, stream);
return cub::DeviceFor::ForEachInExtents(mdspan_in.extents(), copy_mdspan_t{mdspan_in, mdspan_out}, env);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: to avoid this API to emit an NVTX range, we have two options:

  1. We add an internal version of ForEachInExtents that does not emit the range
  2. We add a secret parameter to the env that is checked in ForEachInExtents and suppresses the NVTX range.

@github-actions
Copy link
Contributor

😬 CI Workflow Results

🟥 Finished in 1h 27m: Pass: 14%/249 | Total: 3d 07h | Max: 1h 10m | Hits: 86%/42649

See results here.

  - Rename TransformInternal to __transform_internal and make it public
  - Add DeviceFor::__for_each_in_extents_internal (no NVTX, bypasses Bulk)
  - Update dispatch_copy_mdspan to call internal APIs instead of public ones
  - Remove unnecessary proclaim_copyable_arguments wrapper
  - Add _CCCL_NVTX_RANGE_SCOPE to DeviceCopy::Copy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

Add env-based API for cub::DeviceCopy

2 participants